Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls

工作流概述

这是一个包含23个节点的复杂工作流，主要用于自动化处理各种任务。
工作流源代码

下载
{
  "id": "",
  "meta": {
    "instanceId": "",
    "templateCredsSetupCompleted": true
  },
  "name": "Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls",
  "tags": [],
  "nodes": [
    {
      "id": "ca701618-b2d5-48ee-a503-d3513d018a65",
      "name": "Sticky Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        360,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Form - Screaming Frog internal_html.csv upload  

This form node is used to trigger the workflow.  

It contains **three input fields**:  
- Name of the website  
- Short description of the website  
- **Screaming Frog** export containing the internal URLs  



It is recommended to use the **internal_html.csv** export, but **internal_all.csv** will also work, as the workflow includes a filter to process only indexable URLs.
"
      },
      "typeVersion": 1
    },
    {
      "id": "bc040ca0-f38d-4458-a60c-17f71dbfd1ea",
      "name": "Sticky Note1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        780,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Extract data from Screaming Frog file

This node extracts data from the **CSV file** provided by the user.  

It produces an output that is **easily usable** in the following nodes.  

⚠️ **Caution:**  
If the uploaded file is **not** the expected Screaming Frog export, the workflow will still proceed but will likely **fail in the next steps** due to missing required fields.  

"
      },
      "typeVersion": 1
    },
    {
      "id": "f71a7d10-847d-48e7-8820-ec0c1e7ea055",
      "name": "Sticky Note2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1200,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Set Useful Fields  

This node sets **7 key fields** from the Screaming Frog export:  

- `url` → from the **\"Address\"** column  
- `title` → from the **\"Title 1\"** column  
- `description` → from the **\"Meta Description 1\"** column  
- `status` → from the **\"Status Code\"** column  
- `indexability` → from the **\"Indexability\"** column  
- `content_type` → from the **\"Content Type\"** column  
- `word_count` → from the **\"Word Count\"** column  


**Multi-language compatibility**  
If you're using Screaming Frog in **French, Italian, German, or Spanish**, the column names will be different.  
However, the workflow is designed to handle this, so it will **still work correctly**! 🥳
"
      },
      "typeVersion": 1
    },
    {
      "id": "6f6546b8-adeb-4998-ae19-d93525337eb7",
      "name": "Set useful fields",
      "type": "n8n-nodes-base.set",
      "position": [
        1340,
        60
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "0e7d4a06-83fc-4834-93fe-2e758cbe2307",
              "name": "url",
              "type": "string",
              "value": "={{ $json.Address || $json.Adresse || $json.Dirección || $json.Indirizzo }}"
            },
            {
              "id": "c82f4d4c-9d0b-4c7d-9647-5d0240b58643",
              "name": "title",
              "type": "string",
              "value": "={{ $json['Title 1'] || $json['Titolo 1'] || $json['Titolo 1'] || $json['Título 1'] || $json['Titel 1'] }}"
            },
            {
              "id": "abea81db-ce3b-4ac1-bd21-09ccfffb567a",
              "name": "description",
              "type": "string",
              "value": "={{ $json['Meta Description 1'] || $json['Meta description 1'] }}"
            },
            {
              "id": "2ca75d74-70f8-400b-b862-9da186135915",
              "name": "statut",
              "type": "string",
              "value": "={{ $json['Status Code'] || $json['Code HTTP'] || $json['Status-Code'] || $json['Código de respuesta'] || $json['Codice di stato']}}"
            },
            {
              "id": "754d3202-38b0-4d79-ba24-8078b3244307",
              "name": "indexability",
              "type": "string",
              "value": "={{ $json.Indexability || $json.Indexabilité || $json.Indicizzabilità || $json.Indexabilidad || $json.Indexierbarkeit}}"
            },
            {
              "id": "8bc6583d-bb34-4d22-b310-fe79bb8ac85d",
              "name": "content_type",
              "type": "string",
              "value": "={{ $json['Content Type'] || $json['Type de contenu'] || $json['Tipo di contenuto'] || $json['Tipo de contenido'] || $json['Inhaltstyp']}}"
            },
            {
              "id": "c874ba1a-769e-43d3-9555-8c9914ca9b76",
              "name": "word_count",
              "type": "string",
              "value": "={{ $json['Word Count'] || $json['Nombre de mots'] || $json['Conteggio delle parole'] || $json['Conteggio delle parole'] || $json['Recuento de palabras'] || $json['Wortanzahl'] }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "1a9af7a0-d2d5-44cb-9770-2d5a1e5706f4",
      "name": "Text Classifier",
      "type": "@n8n/n8n-nodes-langchain.textClassifier",
      "disabled": true,
      "position": [
        2260,
        60
      ],
      "parameters": {
        "options": {},
        "inputText": "=url : {{ $json.url }}
title : {{ $json.title }}
description : {{ $json.description }}
words count : {{ $json.word_count }}",
        "categories": {
          "categories": [
            {
              "category": "useful_content",
              "description": "Pages that are likely to contain high-quality content, making them suitable for inclusion in a file that aids content discovery for an LLM. "
            },
            {
              "category": "other_content",
              "description": "Pages that should not be included (e.g., pagination, or low-value content)."
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "74a4e378-4228-4142-92ca-e541efde2b15",
      "name": "OpenAI Chat Model",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
      "position": [
        2180,
        240
      ],
      "parameters": {
        "model": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-4o-mini"
        },
        "options": {}
      },
      "credentials": {
        "openAiApi": {
          "id": "",
          "name": "OpenAi Connection"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "63dc6cfe-bc73-43b5-8c7d-4f5fd6501d3b",
      "name": "No Operation, do nothing",
      "type": "n8n-nodes-base.noOp",
      "position": [
        2580,
        200
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "cb555b99-9e63-4b6b-a1fc-512b5467d666",
      "name": "Sticky Note3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1620,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Filter URLs 

This **filter node** is used to keep only the URLs that meet the following conditions:  
- `status` = **200**  
- `indexability` = **indexable**  
- `content_type` contains **text/html**  


These filters are even **more useful** if the uploaded file is an **internal_all.csv** instead of an **internal_html.csv**.  

### **Tips:**  
You can **add more filters** to refine the URLs included in your `llms.txt` file.  

💡 **Examples:**  
- **Filter by word count** → Ensure pages contain **enough text content**.  
- **Filter by URL path** → Keep only **specific folders or categories** in the `llms.txt` file.  
- **Filter by meta description** → Exclude URLs **without a meta description**, as this field will be used in the `llms.txt` file to describe each piece of content.  
"
      },
      "typeVersion": 1
    },
    {
      "id": "e34e56e2-5cc8-4e50-bfb0-3aa2e5e04abf",
      "name": "Filter URLs",
      "type": "n8n-nodes-base.filter",
      "position": [
        1740,
        60
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "cef4feaa-1c46-45b1-92b7-f5c2051b1dc5",
              "operator": {
                "type": "number",
                "operation": "equals"
              },
              "leftValue": "={{ Number($json.statut) }}",
              "rightValue": 200
            },
            {
              "id": "bb821656-9740-4da4-8aa9-f65ad098c470",
              "operator": {
                "type": "boolean",
                "operation": "true",
                "singleValue": true
              },
              "leftValue": "={{ [\"Indexable\", \"Indicizzabile\", \"Indexierbar\"].includes($json.indexability) }}",
              "rightValue": "={{ \"Indexable\" || \"Indicizzabile\" }}"
            },
            {
              "id": "5c93ddb8-8091-406a-bc04-fa14e8b73fb9",
              "operator": {
                "type": "string",
                "operation": "contains"
              },
              "leftValue": "={{ $json.content_type }}",
              "rightValue": "text/html"
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "b98f19a8-afd3-4d26-8063-dee3ee75055f",
      "name": "Sticky Note4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2040,
        -800
      ],
      "parameters": {
        "color": 2,
        "width": 740,
        "height": 1160,
        "content": "## Text Classifier

🚫 **This node is deactivated by default** in the template.  

You can **enable it** if you want to add a more **\"intelligent\" 🤓 filter** to refine the URLs included in the `llms.txt` file, helping LLMs discover and prioritize valuable content.

### How It Works:
This node has **two outputs**:  
- **`useful_content`** → Pages that are **likely to contain high-quality content**, making them suitable for inclusion in a file that **aids content discovery for an LLM**.  
- **`other_content`** → Pages that should **not** be included (e.g., pagination or low-value content).  


You can **modify the description** in the node to fine-tune the classification according to your needs.  

### Input Fields:
- **url** → `{{ $json.url }}`  
- **title** → `{{ $json.title }}`  
- **description** → `{{ $json.description }}`  
- **word_count** → `{{ $json.word_count }}`  

### Why use an LLM?  
A **language model (LLM)** can **analyze** the **URL, title, and description** to identify pages that **most likely contain meaningful and relevant content**.  
This allows it to **prioritize valuable pages** and structure the data for **better content discovery and training purposes**. 

### **For large websites**  
If you have a **very large website**, consider using a **Loop Over Items** node to make the workflow **more robust** and ensure all pages are processed.  
Also, using a **Loop Over Items** node make it **easier** to handle:  
- **Timeouts** 
- **API quotas** 
- **Other scalability issues**

### Tokens usage
Finally, keep in mind that **more pages mean more tokens and more billed LLM API calls**.







"
      },
      "typeVersion": 1
    },
    {
      "id": "63e3ea7a-cec3-442c-9812-771def0a9949",
      "name": "Sticky Note5",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2840,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Set Field - llms.txt Row

This node **sets** the row format for the `llms.txt` file.  

### Row Structure:
Each row follows this format:  

- `- [title](link): description`  

If the URL **has no description** (from the **Meta Description** in the Screaming Frog export), the row will be:  

- `- [title](link)`  
"
      },
      "typeVersion": 1
    },
    {
      "id": "78f58220-feb5-4044-b994-39a0e4f1e9e4",
      "name": "Sticky Note6",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3260,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Summarize - Concatenate

This node concatenates all the output from the previous node, ensuring each row is on a separate line."
      },
      "typeVersion": 1
    },
    {
      "id": "7a119633-7cd3-4de5-a1cd-7f708e1abf4a",
      "name": "Sticky Note7",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3680,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Set Fields - llms.txt Content

This node sets the content of the `llms.txt` file using:

- The **website title** provided in the form (first node).
- The **website description** provided in the form (first node).
- The output from the previous node, which includes all the URLs, their titles, and their descriptions that will appear in the `llms.txt` file.
"
      },
      "typeVersion": 1
    },
    {
      "id": "554f6858-68e8-4b35-a6c4-21bed6832323",
      "name": "Sticky Note8",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        4100,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Generate llms.txt file

This node **creates** the `llms.txt` file, which can be **downloaded directly** within n8n. 
"
      },
      "typeVersion": 1
    },
    {
      "id": "24bdefba-e2f2-41f0-93e7-9f8d2fc11f43",
      "name": "Sticky Note9",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        4520,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## upload file anywhere

Instead of downloading the file directly from the n8n workflow, you can **replace this node node** with a Drive node (e.g., **Google Drive** or **OneDrive**) to upload the `llms.txt` file to a folder of your choice.  
  
**Name the file properly** (e.g., include the website name) to make it easier to find and distinguish between files when working on multiple websites.  
"
      },
      "typeVersion": 1
    },
    {
      "id": "a3be51e3-810c-40a7-a996-98a3d383c2b9",
      "name": "Summarize - Concatenate",
      "type": "n8n-nodes-base.summarize",
      "position": [
        3380,
        40
      ],
      "parameters": {
        "options": {},
        "fieldsToSummarize": {
          "values": [
            {
              "field": "llmTxtRow",
              "separateBy": "
",
              "aggregation": "concatenate"
            }
          ]
        }
      },
      "typeVersion": 1.1
    },
    {
      "id": "8d3a892a-3d11-4d8a-8ec6-84f8f3af1183",
      "name": "Set Fields - llms.txt Content",
      "type": "n8n-nodes-base.set",
      "position": [
        3820,
        40
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "97062a99-e944-4e1e-89b1-62cf9e3462dd",
              "name": "llmsTxtFile",
              "type": "string",
              "value": "=# {{ $('Form - Screaming frog internal_html.csv upload').item.json['What is the name of your website?'] }}
> {{ $('Form - Screaming frog internal_html.csv upload').item.json['Can you provide a short description of your website? (in the language of the website)'] }}

{{ $json.concatenated_llmTxtRow }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "bc2a692a-47ea-4bf1-a102-e607fd544158",
      "name": "upload file anywhere",
      "type": "n8n-nodes-base.noOp",
      "position": [
        4640,
        40
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "404510a2-35b2-44cf-9d02-eb0abcf4e9b3",
      "name": "Set Field - llms.txt Row",
      "type": "n8n-nodes-base.set",
      "position": [
        2960,
        40
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "95e75caa-8110-476b-9cb1-73c15361fa56",
              "name": "llmTxtRow",
              "type": "string",
              "value": "=- [{{ $json.title }}]({{ $json.url }}){{ $json.description ? ': ' + $json.description : '' }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "f54d51f2-17bc-4c58-b177-0e77e16f7b72",
      "name": "Sticky Note10",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -420,
        -1020
      ],
      "parameters": {
        "color": 5,
        "width": 700,
        "height": 1380,
        "content": "# Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls  

This workflow helps you generate an **llms.txt** file (if you're unfamiliar with it, check out [this article](https://towardsdatascience.com/llms-txt-414d5121bcb3/)) using a **Screaming Frog export**.  

[Screaming Frog](https://www.screamingfrog.co.uk/seo-spider/) is a well-known website crawler.  
You can easily crawl a website. Then, export the **\"internal_html\"** section in CSV format.  

## How It Works: 

A **form** allows you to enter:  
- The **name of the website**  
- A **short description**  
- The **internal_html.csv** file from your Screaming Frog export  


Once the form is submitted, the **workflow is triggered automatically**, and you can **download the llms.txt file directly from n8n**. 

## Downloading the File
Since the last node in this workflow is **\"Convert to File\"**, you will need to **download the file directly from the n8n UI**.  
However, you can easily **add a node** (e.g., Google Drive, OneDrive) to automatically upload the file **wherever you want**.  

## AI-Powered Filtering (Optional):  
This workflow includes a **text classifier node**, which is **deactivated by default**.  
- You can **activate it** to apply a more **intelligent filter** to select URLs for the `llms.txt` file.  
- Consider modifying the **description** in the classifier node to specify the type of URLs you want to include.  

## How to Use This Workflow  

1. **Crawl the website** you want to generate an `llms.txt` file for using **Screaming Frog**.  
2. **Export the \"internal_html\"** section in CSV format.  
   ![Screaming Frog internal html export](https://i.imgur.com/M0nJQiV.png)  
3. In **n8n**, click **\"Test Workflow\"**, fill in the form, and **upload** the `internal_html.csv` file.  
4. Once the workflow is complete, go to the **\"Export to File\"** node and **download the output**.  

**That's it! You now have your llms.txt file!**  



**Recommended Usage:**  
Use this workflow **directly in the n8n UI by clicking** 'Test Workflow' and uploading the file in the form."
      },
      "typeVersion": 1
    },
    {
      "id": "e33104af-802a-43f2-b26d-f368f7de2fd7",
      "name": "Form - Screaming frog internal_html.csv upload",
      "type": "n8n-nodes-base.formTrigger",
      "position": [
        460,
        60
      ],
      "webhookId": "8791f39a-3d81-405c-b177-0a733ebf74cb",
      "parameters": {
        "options": {
          "buttonLabel": "Get the llms.txt file"
        },
        "formTitle": "llms.txt Generator - From Screaming Frog export",
        "formFields": {
          "values": [
            {
              "fieldLabel": "What is the name of your website?",
              "placeholder": "Example : The best website ever",
              "requiredField": true
            },
            {
              "fieldLabel": "Can you provide a short description of your website? (in the language of the website)",
              "placeholder": "Example : This is the best website ever because all the content is engaging and valuable.",
              "requiredField": true
            },
            {
              "fieldType": "file",
              "fieldLabel": "screaming_frog_export",
              "multipleFiles": false,
              "requiredField": true,
              "acceptFileTypes": ".csv"
            }
          ]
        },
        "responseMode": "lastNode",
        "formDescription": "Generate a simple llms.txt file from a Screaming Frog Export
It is recommended to use the internal_html.csv export, although internal_all.csv will also work.

Fill in the fields in this form.Just fill in the fields in this form  😄"
      },
      "typeVersion": 2.2
    },
    {
      "id": "f6b17fdd-a098-411e-8d53-3f6e638cc3ba",
      "name": "Extract data from Screaming Frog file",
      "type": "n8n-nodes-base.extractFromFile",
      "position": [
        900,
        60
      ],
      "parameters": {
        "options": {},
        "operation": "xls",
        "binaryPropertyName": "screaming_frog_export"
      },
      "typeVersion": 1
    },
    {
      "id": "6bbd8d1f-3322-4c6d-af08-c842386239ce",
      "name": "Generate llms.txt file",
      "type": "n8n-nodes-base.convertToFile",
      "position": [
        4220,
        40
      ],
      "parameters": {
        "options": {
          "encoding": "utf8",
          "fileName": "llms.txt"
        },
        "operation": "toText",
        "sourceProperty": "llmsTxtFile"
      },
      "typeVersion": 1.1
    }
  ],
  "active": false,
  "pinData": {},
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "",
  "connections": {
    "Filter URLs": {
      "main": [
        [
          {
            "node": "Text Classifier",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Text Classifier": {
      "main": [
        [
          {
            "node": "Set Field - llms.txt Row",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "No Operation, do nothing",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "OpenAI Chat Model": {
      "ai_languageModel": [
        [
          {
            "node": "Text Classifier",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "Set useful fields": {
      "main": [
        [
          {
            "node": "Filter URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Generate llms.txt file": {
      "main": [
        []
      ]
    },
    "Summarize - Concatenate": {
      "main": [
        [
          {
            "node": "Set Fields - llms.txt Content",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Set Field - llms.txt Row": {
      "main": [
        [
          {
            "node": "Summarize - Concatenate",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Set Fields - llms.txt Content": {
      "main": [
        [
          {
            "node": "Generate llms.txt file",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract data from Screaming Frog file": {
      "main": [
        [
          {
            "node": "Set useful fields",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Form - Screaming frog internal_html.csv upload": {
      "main": [
        [
          {
            "node": "Extract data from Screaming Frog file",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}
功能特点

自动检测新邮件
AI智能内容分析
自定义分类规则
批量处理能力
详细的处理日志
技术分析

节点类型及作用

Stickynote
Set
@N8N/N8N Nodes Langchain.Textclassifier
@N8N/N8N Nodes Langchain.Lmchatopenai
Noop
复杂度评估

配置难度：
★★★★☆
维护难度：
★★☆☆☆
扩展性：
★★★★☆
实施指南

前置条件

有效的Gmail账户
n8n平台访问权限
Google API凭证
AI分类服务订阅
配置步骤

在n8n中导入工作流JSON文件
配置Gmail节点的认证信息
设置AI分类器的API密钥
自定义分类规则和标签映射
测试工作流执行
配置定时触发器（可选）
关键参数

参数名称	默认值	说明
maxEmails	50	单次处理的最大邮件数量
confidenceThreshold	0.8	分类置信度阈值
autoLabel	true	是否自动添加标签
最佳实践

优化建议

定期更新AI分类模型以提高准确性
根据邮件量调整处理批次大小
设置合理的分类置信度阈值
定期清理过期的分类规则
安全注意事项

妥善保管API密钥和认证信息
限制工作流的访问权限
定期审查处理日志
启用双因素认证保护Gmail账户
性能优化

使用增量处理减少重复工作
缓存频繁访问的数据
并行处理多个邮件分类任务
监控系统资源使用情况
故障排除

常见问题

邮件未被正确分类

检查AI分类器的置信度阈值设置，适当降低阈值或更新训练数据。
Gmail认证失败

确认Google API凭证有效且具有正确的权限范围，重新进行OAuth授权。
调试技巧

启用详细日志记录查看每个步骤的执行情况
使用测试邮件验证分类逻辑
检查网络连接和API服务状态
逐步执行工作流定位问题节点
错误处理

工作流包含以下错误处理机制：
网络超时自动重试（最多3次）
API错误记录和告警
处理失败邮件的隔离机制
异常情况下的回滚操作